A key point problem is a variant of image regression in which a "key point" refers to a specific location represented in an image.
In this tutorial, we'll be looking for the centre of the person's face in each image, predicting two values for each image: the row and column of the face centre. The full fastai tutorial can be found in their notebook manual:
This problem can then be expanded to find the centroid of an asteroid!
Before starting, you will need to set colab to use the GPU:
The fastai2 library is not pre-installed in colab so we first need to pip install it. Using "!" at the beginning of a cell in a Jupyter notebook runs a shell command inside the notebook, and installs the library directly to the colab virtual machine. You only need to rerun this when restarting the kernel.
!pip3 install fastai2
Any machine learning task that involves understading images falls into the area of Computer Vision. In fastai, all the modules related to computer vision can be found under fastai.vision.
from fastai2 import *
from fastai2.vision.all import *
import numpy as np
import pandas as pd
We will use a dataset hosted by fastai for this tutorial which contains images (.jpg files) and the coordinates of the centre of the faces on each image in their corresponding pose.txt file. These are grouped into 24 different directories, each containing independent photographs of different people.
path = untar_data(URLs.BIWI_HEAD_POSE)
path.ls()
Path.BASE_PATH = path
path.ls().sorted()
(path/'01').ls().sorted()
img_files = get_image_files(path)
im = PILImage.create(img_files[0])
im.shape
im.to_thumb(254)
For each image, we need to be able to load in the location of the centre of the head.
def img2pose(x): return Path(f'{str(x)[:-7]}pose.txt')
cal = np.genfromtxt(path/'01'/'rgb.cal', skip_footer=6)
def get_ctr(f):
ctr = np.genfromtxt(img2pose(f), skip_header=3)
c1 = ctr[0] * cal[0][0]/ctr[2] + cal[0][2]
c2 = ctr[1] * cal[1][1]/ctr[2] + cal[1][2]
return tensor([c1,c2])
For fastai2, we need to load each image (ImageBlock) and head coordinate (PointBlock) into a specific DataBlock format:
batch_tfms = [*aug_transforms(size=(240,320)), Normalize.from_stats(*imagenet_stats)]
dblock = DataBlock(blocks = (ImageBlock, PointBlock),
get_items = get_image_files,
get_y = get_ctr,
splitter = FuncSplitter(lambda o: o.parent.name=='13'),
batch_tfms= batch_tfms)
# debugging the DataBlock
#dblock.summary('')
We then call the DataBlock on our image set at path and choose a batch size. Higher batch sizes are more computationally expensive but lead to faster convergence: if your images are too large, you will have to choose a smaller batch size, so we resized them above.
dls = dblock.dataloaders(path, bs=16)
print("Number of images in the training vs validation sets: {}, {}".format(dls.train.n, dls.valid.n))
Check whether the data looks ok before training:
dls.show_batch(max_n=9, figsize=(10,10))
What are the dimensions of your dataset?
xb,yb = dls.one_batch()
xb.shape,yb.shape
im = image2tensor(Image.open(img_files[0]))
_,axs = subplots(1,3)
for i,ax,color in zip(im,axs,('Reds','Greens','Blues')):
show_image(255-i, ax=ax, cmap=color)
As we are working with images, we will make use of transfer learning to improve our accuracy and convergence speed by using a pre-trained model.
Here, we make use of ResNet, a classic neural network used as a backbone for many computer vision tasks. This model was the winner of ImageNet challenge in 2015. The ImageNet challenge was a classification task for thousands of image categories, and millions of images.
With transfer learning, we can take this pre-trained classification model and use it for a task that is different to what it was originally trained for: our regression problem.
We keep and freeze the first layers of the network, which have been trained on the large imagenet dataset. These early layers have already learnt about what edges and colours look like. Later layers of this network have learnt the specific characteristics of e.g. dog breeds, that we are not interested in, so we remove these layers and replace them with random weights.
Then, we only optimise these last few layers, which are tailored towards our smaller specific dataset. This approach is faster and requires much less data than learning from scratch.
First, we set up a learner object:
We use a convolutional neural network (cnn) with weights pre-trained on ImageNet data (from resnet). This function has been designed for transfer learning and initialises the weights on the latter layers (the head) randomly.
Since we're predicting a continuous number, rather than a category, we have to tell fastai what range our target has, using the y_range parameter.
We can specify which metrics we are interested in for evaluating how the model performs on the validation dataset. Available built in metrics can be found in: https://github.com/fastai/fastai2/blob/master/fastai2/metrics.py.
learn = cnn_learner(dls, resnet18, y_range=(-1,1), metrics=[mse, mae]) # try resnet18, resnet34, resnet50
Here, the default loss function is the mean square error (typical for regression problems).
dls.loss_func
Choosing your learning rate:
The learning rate is often one of the most important parameters as it is used to define the step size in the optimisation:
Good choices:
Human in the loop: the shape of the learning rate finder might be such that the returned suggested values are not the best for your problem...
suggested_lrs = learn.lr_find()
suggested_lrs
# choose learning rate based on lr_finder
lr = suggested_lrs.lr_steep
Fine tuning is the fastai helper for transfer learning. First, we have one epoch where we initalise the weights of the pre-trained ResNet model and train the whole network once. Then we freeze these layers, and only train the last layers of the network which are specific to our data.
learn.fine_tune(epochs=3, base_lr=lr)
# choose learning rate based on lr_finder
lr = 5e-3
Fastai uses callbacks to help you customise your training: here we want to save which is the best model through training, and reload that at the end.
# define callback to save best model
cb = SaveModelCallback(fname="best")
learn.fine_tune(epochs=2, base_lr=lr, cbs=[cb])
print("Best MSE (loss): {}".format(cb.best))
print("Prediction error: {:.4}%".format(np.sqrt(cb.best)*100))
Beware of overfitting, when the validation losses start to increase.
learn.recorder.plot_loss()
Fastai implements a learning rate scheme called "fit one cycle", where the learning rates are varied over the course of training to improve convergence. This includes a warm-up stage, where the learning rates are smaller at the beginning of training, before building up to a maximum learning rate and then decaying towards the end. For best results, these smaller learning rates at the end of training should not be wasted on epochs which are overfitting.
learn.recorder.plot_sched()
learn.show_results(ds_idx=1) #figsize=(6,8))
Will using deep learning lead to a measureable improvement on more traditional, or more easily interpretable, methods?
Here we calculate predictive accuracy obtained using a dummy baseline, in which we predict the centre of the face is the centre of the image, to compare to our ML results.
Retrieve true coordinates of centre of face from the validation data set:
valid_img_files = get_image_files((path/"13"))
true = np.array([get_ctr(f).numpy() for f in valid_img_files])
true[:5]
Predictive values as the central pixel of each image:
im = PILImage.create(valid_img_files[0])
xshape, yshape = im.shape[0], im.shape[1]
xshape, yshape
pred = np.vstack((np.array([0.5 * xshape]*len(true)), np.array([0.5 * yshape]*len(true)))).T
pred[:5]
# normalise for comparable results to ML
mse = np.average(np.average((true/yshape - pred/yshape) ** 2, axis=0))
print("Baseline best MSE: {}".format(mse))
print("Prediction error: {:.4}%".format(np.sqrt(mse)*100))
Plot the cases where the model performs the worst (and best) to try and understanding what the model is deficient in, and successful at. Is it cheating? Or learning as we would expect it to?
interp = Interpretation.from_learner(learn)
def fplot_top_losses(interp, dls, k=4, largest=True):
"""
worst cases: largest = True
best cases: largest = False
for idx in worst.indices:
dls.show(dls.valid.dataset[idx])
"""
# retrieve top losses and indices
losses = interp.top_losses(k, largest)
# get corresponding image files
imgs = [dls.valid.items[i] for i in losses.indices]
# plot image files
dls.test_dl([PILImage.create(i) for i in imgs]).show_batch()
return imgs
worst = fplot_top_losses(interp, dls, k=4, largest=True)
best = fplot_top_losses(interp, dls, k=4, largest=False)
Extension activity!
Class Activation Mapping (or CAM) is a common techique in computer vision for "explaianble AI". It maps the importance of each input (pixel) with respect to the changes of the outputs activations. it's normally visualised as a heatmap over the image to highlight the parts most important for the prediction.
To access the activations inside the model while it's training, we need to use PyTorch hooks. For the full tutorial and explanation, see:
class Hook():
def __init__(self, m):
self.hook = m.register_forward_hook(self.hook_func)
def hook_func(self, m, i, o): self.stored = o.detach().clone()
def __enter__(self, *args): return self
def __exit__(self, *args): self.hook.remove()
def fshow_cam(learn, dls, x):
"""
plot class activation map
"""
with Hook(learn.model[0]) as hook:
with torch.no_grad(): output = learn.model.eval()(x.cuda())
act = hook.stored[0]
cam_map = torch.einsum('ck,kij->cij', learn.model[1][-2].weight, act)
x_dec = TensorImage(dls.valid.decode((x,))[0][0])
fig,ax = plt.subplots()
x_dec.show(ctx=ax)
im = ax.imshow(cam_map[1].detach().cpu(), alpha=0.4, extent=(0,x.shape[3],x.shape[2],0),
interpolation='bilinear', cmap='jet')
fig.colorbar(im)
plt.show()
for img in worst:
im = PILImage.create(img)
x, = first(dls.test_dl([im]))
fshow_cam(learn, dls, x)
for img in best:
im = PILImage.create(img)
x, = first(dls.test_dl([im]))
fshow_cam(learn, dls, x)